suppressPackageStartupMessages({
library(tidyverse)
library(dplyr)
library(plotly)
library(ggplot2)
library(maps)
library(mapdata)
library(RColorBrewer)
library(sf)
library(tigris)
library(leaflet)
library(leaflet.extras)
library(viridis)
library(tidyr)
library(tidytext)
library(wordcloud)
library(lubridate)
library(readr)
})
Imagine you are a data scientist at a respected media outlet – say the “New York Times”. Your editor wants to support the writing of a feature article about How Couples Meet and Stay Together. Your editor-in-chief asks you to analyze some data from an extensive survey conducted by a researcher at Stanford University.
Since there is no way that all features of the data can be represented in such a memo, feel free to pick and choose some patterns that would make for a good story – outlining important patterns and presenting them in a visually pleasing way.
The full background and text of the story will be researched by a writer of the magazine – your input should be based on the data and some common sense (i.e. no need to read up on this). It does help, however, to briefly describe what you are presenting and what it highlights.
Provide polished plots that are refined enough to include in the magazine with very little further manipulation (already include variable descriptions [if necessary for understanding], titles, source [e.g. “How Couples Meet and Stay Together (Rosenfeld, Reuben, Falcon 2018)”], appropriate colors, fonts etc.) and are understandable to the average reader of the “New York Times”. The design does not need to be NYTimes-like.
We will be using the 2017 wave of the HCMST survey - provided as HCMST_couples.rds. The file HCMST_variable_descriptions.pdf contains most of the variable descriptions and coding of responses.
dating_data <- readRDS("/Users/ulrika/Documents/GitHub/course_content/Exercises/03_dating_GRADED/HCMST_couples.rds")
Visualize how the mode of meeting for the first time has changed over the years.
Create two (2) charts in this section to highlight some important pattern(s). Make sure to use some variation in the type of visualization. Briefly discuss which visualization you recommend to your editor and why.
categories <- unique (dating_data$meeting_type)
print (categories)
## [1] Primary or Secondary School College
## [3] Military Church
## [5] Volunteer Organization Customer-Client Relationship
## [7] Bar or Restaurant Private Party
## [9] Internet Internet Dating or Phone App
## [11] Internet Social Network Online Gaming
## [13] Internet Chat Internet Site
## [15] Public Place Blind Date
## [17] On Vacation One-time Service Interaction
## [19] Business Trip Work Neighbors
## [21] Met Online
## 21 Levels: Bar or Restaurant Blind Date Business Trip Church ... Work Neighbors
# Simplifying data categories
meeting_type_simplified <- c("educational", "professional", "community", "social", "online")
dating_data$meeting_type_simplified <- case_when(
dating_data$meeting_type %in% c('Primary or Secondary School', 'College') ~ 'educational',
dating_data$meeting_type %in% c('Work Neighbors', 'Customer-Client Relationship', 'One-Time Service Interaction', 'Business Trip') ~ 'professional',
dating_data$meeting_type %in% c('Volunteer Organization', 'Military', 'Church') ~ 'community',
dating_data$meeting_type %in% c('Bar or Restaurant', 'Private Party', 'On Vacation', 'Public Place', 'Blind Date') ~ 'social',
dating_data$meeting_type %in% c('Internet', 'Internet Dating or Phone App', 'Internet Social Network', 'Online Gaming', 'Internet Chat', 'Internet Site', 'Met Online') ~ 'online',
TRUE ~ 'other')
data_aggregated <- aggregate(cbind(Frequency = meeting_type_simplified) ~ Q21A_Year + meeting_type_simplified, data = dating_data, FUN = length)
# Graph 1: Line Chart
data_aggregated$Q21A_Year <- as.numeric(as.character(data_aggregated$Q21A_Year))
## Warning: NAs introduced by coercion
min_year <- min(data_aggregated$Q21A_Year, na.rm = TRUE)
max_year <- max(data_aggregated$Q21A_Year, na.rm = TRUE)
year_breaks <- seq(from = min_year, to = max_year, by = 20)
ggplot(data_aggregated, aes(x = Q21A_Year, y = Frequency, color = meeting_type_simplified, group = meeting_type_simplified)) +
geom_line() +
labs(title = "How Couples Meet Over Time",
subtitle = "Trends in Meeting Venues",
x = "Year",
y = "Number of Couples") +
theme_minimal() +
scale_color_brewer(palette = "Set3") +
scale_x_continuous(breaks = year_breaks)
## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_line()`).
# Graph 2: stack area chart
ggplot(data_aggregated, aes(x = Q21A_Year, y = Frequency, fill = meeting_type_simplified)) +
geom_area(position = 'stack') +
labs(title = "Shift in How Couples Meet Over Time",
subtitle = "Changing Proportions of Meeting Types",
x = "Year",
y = "Number of Couples") +
theme_minimal() +
scale_fill_brewer(palette = "Set3")
## Warning: Removed 5 rows containing non-finite outside the scale range
## (`stat_align()`).
I would recommend the stacked area chart for the feature article because it provides a comprehensive view of the historical shifts in how couples meet. It not only shows the trend of each category over time but also how each category contributes to the total, allowing the reader to see the rise of online meetings in the context of other modes declining or remaining static.
Create one (1) visualization to show the relationship between a respondent’s age and their partner’s age, accounting for the gender of the respondent? Identify the main pattern in the graph via an annotation directly added to the plot.
dating_data %>%
filter(!is.na(ppage), !is.na(Q9), !is.na(ppgender)) %>%
ggplot(aes(x = ppage, y = Q9, color = ppgender)) +
geom_point(alpha = 0.6) +
scale_color_manual(values = c("Male" = "lightskyblue", "Female" = "pink1")) +
geom_smooth(method = "lm", se = FALSE, color = "red2") +
labs(title = "Relationship Between Respondent's and Partner's Age",
subtitle = "Accounting for Gender of the Respondent",
x = "Respondent's Age",
y = "Partner's Age",
color = "Gender") +
theme_minimal() +
annotate("text", x = Inf, y = 0, label = "Across genders, there is a strong
positive correlation between the ages of respondents and their
partners, with a tendency for individuals to partner with someone
of a similar age.",
size = 3, color = "indianred3", hjust = 1, vjust = 0)
## `geom_smooth()` using formula = 'y ~ x'
Explore how the political affiliation of partners affects how couples meet and stay together.
Create two (2) charts in this section to highlight the information on politics and dating. Make sure to use some variation in the type of visualizations. Discuss which visualization you recommend to your editor and why.
# Graph 1: Bar Chart
dating_data %>%
filter(!is.na(partyid7), !is.na(meeting_type_simplified)) %>%
ggplot(aes(x = meeting_type_simplified, fill = partyid7)) +
geom_bar(position = "dodge") +
labs(title = "Meeting Modes by Political Affiliation",
x = "Meeting Mode",
y = "Count",
fill = "Political Affiliation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_brewer(palette = "RdBu")
# Graph 2: Heat Map
avg_duration <- dating_data %>%
filter(partyid7 != "Refused", w6_q12 != "Refused") %>%
group_by(partyid7, w6_q12) %>%
summarise(average_duration = mean(duration, na.rm = TRUE), .groups = 'drop')
ggplot(avg_duration, aes(x = partyid7, y = w6_q12, fill = average_duration)) +
geom_tile() +
scale_fill_gradientn(colors = RColorBrewer::brewer.pal(9, "YlOrRd")) +
labs(title = "Average Relationship Duration by Political Affiliation Pairings",
x = "Respondent's Political Group",
y = "Partner's Political Group",
fill = "Average Duration (days)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
I recommend the heat map for the feature article. This visualization goes beyond the initial meeting and dives into the dynamics of relationship duration within the context of political alignment, offering a deeper understanding of how shared or differing political views might influence relationship longevity. While the bar chart offers valuable insights into meeting modes, the heat map provides a more comprehensive view of relationship dynamics, which seems especially relevant for a feature article exploring the nuances of how couples meet and stay together in the context of political affiliations.
Choose 2 of the plots you created above and add interactivity. For at
least one of these interactive plots, this should not be done through
the use of ggplotly. Briefly describe to the editor why
interactivity in these visualizations is particularly helpful for a
reader.
interactive_line <- plot_ly(data_aggregated, x = ~Q21A_Year, y = ~Frequency,
type = 'scatter', mode = 'lines', color =
~meeting_type_simplified, colors =
RColorBrewer::brewer.pal(8, "Set3")) %>%
layout(title = "How Couples Meet Over Time",
subtitle = "Trends in Meeting Venues",
xaxis = list(title = "Year"),
yaxis = list(title = "Number of Couples"),
hovermode = "closest")
interactive_line
## Warning: 'layout' objects don't have these attributes: 'subtitle'
## Valid attributes include:
## '_deprecated', 'activeshape', 'annotations', 'autosize', 'autotypenumbers', 'calendar', 'clickmode', 'coloraxis', 'colorscale', 'colorway', 'computed', 'datarevision', 'dragmode', 'editrevision', 'editType', 'font', 'geo', 'grid', 'height', 'hidesources', 'hoverdistance', 'hoverlabel', 'hovermode', 'images', 'legend', 'mapbox', 'margin', 'meta', 'metasrc', 'modebar', 'newshape', 'paper_bgcolor', 'plot_bgcolor', 'polar', 'scene', 'selectdirection', 'selectionrevision', 'separators', 'shapes', 'showlegend', 'sliders', 'smith', 'spikedistance', 'template', 'ternary', 'title', 'transition', 'uirevision', 'uniformtext', 'updatemenus', 'width', 'xaxis', 'yaxis', 'barmode', 'bargap', 'mapType'
filtered_data <- dating_data %>%
filter(!is.na(ppage), !is.na(Q9), !is.na(ppgender))
interactive_scatter <- plot_ly(data = filtered_data, x = ~ppage, y = ~Q9,
type = 'scatter', mode = 'markers',
color = ~ppgender, colors = c('Male' = 'lightskyblue', 'Female' = 'pink1'),
marker = list(opacity = 0.6)) %>%
layout(title = "Relationship Between Respondent's and Partner's Age",
xaxis = list(title = "Respondent's Age"),
yaxis = list(title = "Partner's Age"),
hovermode = 'closest')
interactive_scatter
Interactivity allows readers to engage directly with the data in ways static charts do not. Users can hover over specific points to get precise information, such as the exact number of couples meeting through different venues each year or the exact ages of respondents and their partners. This direct engagement facilitates a deeper understanding of the data, enabling readers to discover insights that might not be immediately apparent from a static representation. Interactive elements empower readers to focus on aspects of the data that interest them most. For example, they can zoom in on specific time periods in the line chart to examine trends in greater detail or filter the scatter plot by gender to compare trends across different groups.
We are going to investigate severe weather events and their impact. Using data about the locations of weather events occurring in the United States, we want to better understand and visualize their spatial distribution.
NOAA’s National Centers for Environmental Information (NCEI) has data on all severe storm events.
severe_weather_data <- read_csv("/Users/ulrika/Documents/GitHub/course_content/Exercises/07_severe_weather_GRADED/data/storms.csv")
## Rows: 380137 Columns: 49
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (18): STATE, MONTH_NAME, EVENT_TYPE, CZ_TYPE, CZ_NAME, WFO, BEGIN_DATE_T...
## dbl (26): BEGIN_YEARMONTH, BEGIN_DAY, BEGIN_TIME, END_YEARMONTH, END_DAY, EN...
## lgl (5): CATEGORY, TOR_OTHER_WFO, TOR_OTHER_CZ_STATE, TOR_OTHER_CZ_FIPS, TO...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Provide a static state-level choropleth map of the United States
visualizing where monetary damage is recorded (by using the sum of the
variables DAMAGE_PROPERTY_USD and
DAMAGE_CROPS_USD).
state_data <- severe_weather_data %>%
group_by(STATE) %>%
summarise(TOTAL_DAMAGE = sum(DAMAGE_PROPERTY_USD + DAMAGE_CROPS_USD, na.rm = TRUE))
state_data$STATE <- tolower(state_data$STATE)
us_map <- map_data("state")
map_data <- merge(us_map, state_data, by.x = "region", by.y = "STATE")
state_labels <- map_data %>%
group_by(region) %>%
summarise(center_long = mean(long), center_lat = mean(lat), .groups = 'drop')
ggplot(data = map_data, aes(x = long, y = lat, group = group, fill = TOTAL_DAMAGE)) +
geom_polygon(color = "white") +
expand_limits(x = us_map$long, y = us_map$lat) +
scale_fill_viridis(option = "viridis", direction = -1, name = "Total Damage (USD)") +
geom_text(data = state_labels, inherit.aes = FALSE, aes(label = region, x = center_long, y = center_lat), size = 2, check_overlap = TRUE, colour="black")+
labs(title = "State-Level Monetary Damage from Storms in the US", x = "", y = "") +
theme_minimal() +
theme(axis.text = element_blank(),
axis.title = element_blank(),
panel.background = element_blank(),
panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.background = element_blank())
Provide a static county-level choropleth map of the United States
visualizing where monetary damage is recorded (by using the sum of the
variables DAMAGE_PROPERTY_USD and
DAMAGE_CROPS_USD).
county_damage <- severe_weather_data %>%
group_by(CZ_FIPS) %>%
summarise(Total_Damage = sum(DAMAGE_PROPERTY_USD + DAMAGE_CROPS_USD, na.rm = TRUE))
counties_sf <- tigris::counties(cb = TRUE, class = "sf")
## Retrieving data for the year 2022
county_damage <- county_damage %>%
mutate(CZ_FIPS = as.character(CZ_FIPS))
county_damage_sf <- counties_sf %>%
left_join(county_damage, by = c("COUNTYFP" = "CZ_FIPS"))
ggplot(data = county_damage_sf) +
geom_sf(aes(fill = Total_Damage), color = "snow4") +
scale_fill_viridis_c(option = "viridis", direction = -1,
na.value = "ivory", name = "Total Damage (USD)") +
labs(title = "County-Level Monetary Damage from Storms in the US") +
coord_sf() +
theme_void() +
theme(plot.title = element_text(hjust = 0.5)) +
coord_sf(xlim = c(-125, -65), ylim = c(25, 50), expand = FALSE)
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
Create a leaflet map of the United States showing the
location of severe weather events which result in at least one death
(hint: use EVENT_TYPE). Ignore locations that fall outside
the United States. Provide at least three pieces of information on the
incident in a popup.
filtered_data <- severe_weather_data %>%
filter(DEATHS_DIRECT > 0 | DEATHS_INDIRECT > 0) %>%
select(EVENT_TYPE, BEGIN_LAT, BEGIN_LON, BEGIN_DATE_TIME, STATE, DEATHS_DIRECT, DEATHS_INDIRECT) %>%
na.omit()
leaflet(data = filtered_data) %>%
addTiles() %>%
addMarkers(
lng = ~BEGIN_LON, lat = ~BEGIN_LAT,
popup = ~paste(EVENT_TYPE, "<br>",
"Date: ", BEGIN_DATE_TIME, "<br>",
"State: ", STATE, "<br>",
"Direct Deaths: ", DEATHS_DIRECT, "<br>",
"Indirect Deaths: ", DEATHS_INDIRECT)
) %>%
addProviderTiles(providers$Esri.WorldImagery) %>%
setView(lng = -98.583333, lat = 39.833333, zoom = 4)
Start with the previous map. Now, distinguish the markers of the
weather event locations by EVENT_TYPE, i.e. what kind of
weather event occurred. If there are too many categories, collapse some
categories. Choose an appropriate coloring scheme to map the locations
by type of weather event. Add a legend informing the user about the
color scheme. Also make sure that the information about the type of
weather event is now contained in the popup information. Show this
map.
unique(filtered_data$EVENT_TYPE)
## [1] "Flood" "Thunderstorm Wind"
## [3] "Flash Flood" "Lightning"
## [5] "Tornado" "Heavy Rain"
## [7] "Hail" "Waterspout"
## [9] "Marine Thunderstorm Wind" "Marine Strong Wind"
## [11] "Debris Flow"
filtered_data$simplified_category <- dplyr::case_when(
filtered_data$EVENT_TYPE %in% c("Thunderstorm Wind", "Lightning", "Tornado", "Hail", "Marine Thunderstorm Wind", "Marine Strong Wind") ~ "Severe Storms",
filtered_data$EVENT_TYPE %in% c("Flood", "Flash Flood", "Debris Flow") ~ "Floods",
filtered_data$EVENT_TYPE == "Heavy Rain" ~ "Heavy Precipitation",
filtered_data$EVENT_TYPE %in% c("Waterspout") ~ "Marine Events",
TRUE ~ "Other"
)
color_vector <- c("#FFB6C1", "#40E0D0", "#FF7F50", "#98FB98")
unique_categories <- unique(filtered_data$simplified_category)
category_palette <- colorFactor(palette = color_vector, domain = unique_categories)
leaflet(data = filtered_data) %>%
addTiles() %>%
addCircleMarkers(
lng = ~BEGIN_LON, lat = ~BEGIN_LAT,
color = ~category_palette(simplified_category),
popup = ~paste(EVENT_TYPE, "<br>",
"Category: ", simplified_category, "<br>",
"Date: ", BEGIN_DATE_TIME, "<br>",
"State: ", STATE, "<br>",
"Direct Deaths: ", DEATHS_DIRECT, "<br>",
"Indirect Deaths: ", DEATHS_INDIRECT),
radius = 5,
fillOpacity = 0.8
) %>%
addLegend("bottomright", pal = category_palette, values = ~simplified_category,
title = "Event Category",
opacity = 1) %>%
addProviderTiles(providers$Esri.WorldImagery) %>%
setView(lng = -98.583333, lat = 39.833333, zoom = 4)
Add marker clustering, so that zooming in will reveal the individual locations but the zoomed out map only shows the clusters. Show the map with clusters.
leaflet(data = filtered_data) %>%
addTiles() %>%
addMarkers(
lng = ~BEGIN_LON, lat = ~BEGIN_LAT,
clusterOptions = markerClusterOptions(),
popup = ~paste(EVENT_TYPE, "<br>",
"Category: ", simplified_category, "<br>",
"Date: ", BEGIN_DATE_TIME, "<br>",
"State: ", STATE, "<br>",
"Direct Deaths: ", DEATHS_DIRECT, "<br>",
"Indirect Deaths: ", DEATHS_INDIRECT)
) %>%
addLegend("bottomright", pal = category_palette, values = ~simplified_category,
title = "Event Category",
opacity = 1) %>%
addProviderTiles(providers$Esri.WorldImagery) %>%
setView(lng = -98.583333, lat = 39.833333, zoom = 4)
Explore the cinematic world through the lens of movie scripts, analyzing narrative structures, character prominence, and thematic elements across genres and periods. This assignment involves dissecting movie scripts to uncover patterns and trends, employing data visualization techniques to present findings.
We will work with a dataset of approximately 1000 movie scripts and their metadata. This includes the movie’s title, release date, a brief overview, and parsed script files distinguishing dialogue, character information, and scene descriptions. The data was scraped with Aveek Saha’s code.
tagged <- read_csv("/Users/ulrika/Documents/GitHub/course_content/Exercises/09_moviescripts_GRADED/data/tagged.csv.gz", show_col_types = FALSE)
metadata <- read_csv("/Users/ulrika/Documents/GitHub/course_content/Exercises/09_moviescripts_GRADED/data/metadata.csv", show_col_types = FALSE)
Analyze the dialogue content of movie scripts. Transform the dialogue into a tidy data frame, breaking down the text into individual words, removing common stop words and other unnecessary elements. As needed, use the cleaning functions introduced in lecture (or write your own in addition) to remove unnecessary words (stop words), syntax, punctuation, numbers, white space, etc. Visualize the 20 most frequently used words in the dialogues to gain insights into the core thematic elements of the scripts.
dialogues <- tagged %>%
filter(Tag == "Dialogue")
tidy_dataframe <- dialogues %>%
mutate(Content = str_replace_all(tolower(Content), "[^\\w\\s']", " "),
Content = str_squish(Content)) %>%
unnest_tokens(word, Content) %>%
anti_join(stop_words, by = "word") %>%
count(word, sort = TRUE)
tidy_dataframe %>%
top_n(20) %>%
ggplot(aes(x = reorder(word, n), y = n, fill = sqrt(n))) +
geom_col(show.legend = FALSE) +
coord_flip() +
scale_fill_viridis_c(direction = -1) +
labs(x = "Word", y = "Frequency", title = "Top 20 Most Common Words in Movie Script Dialogues") +
theme_minimal() +
theme(
plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
axis.title = element_text(size = 12),
axis.text = element_text(size = 12),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
axis.line = element_line(color = "grey", size = 0.5),
plot.margin = margin(1, 1, 1, 1, "cm")
)
## Selecting by n
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Create word clouds for a selection of movies to visualize the most prevalent themes or phrases. Choose a set of movies, either randomly or based on specific criteria, and use their script dialogues to generate the clouds.
tidy_dataframe_with_id <- dialogues %>%
select(movie_id, Content) %>%
mutate(Content = tolower(Content),
Content = gsub("[^a-z']", " ", Content),
Content = gsub("\\s+", " ", Content)) %>%
unnest_tokens(word, Content) %>%
anti_join(stop_words, by = "word") %>%
group_by(movie_id, word) %>%
summarise(n = n(), .groups = 'drop') %>%
arrange(desc(n))
set.seed(123)
unique_movie_ids <- unique(tidy_dataframe_with_id$movie_id)
selected_movie_ids <- sample(unique_movie_ids, 2)
for (movie_id in selected_movie_ids) {
movie_dialogues <- tidy_dataframe_with_id %>%
filter(movie_id == movie_id) %>%
select(word, n)
cat("Word Cloud", movie_id, "\n")
wordcloud(
words = movie_dialogues$word,
freq = movie_dialogues$n,
min.freq = 10,
max.words = 150,
random.order = FALSE,
rot.per = 0.35,
colors = brewer.pal(8, "Dark2"),
scale = c(4, 0.5)
)
cat("\n---\n")
}
## Word Cloud Monkeybone_parsed
##
## ---
## Word Cloud Ninotchka_parsed
##
## ---
Using this list of profanities calculate a profanity score for each movie, indicating how often these words were used in the script. Visualize the Top 10 movies with most profanity, and show how the use of profanity has changed over time (using the movie release date).
profanities_link <- "https://www.cs.cmu.edu/~biglou/resources/bad-words.txt"
profanities <- readLines(con = profanities_link, warn = FALSE)
median_popularity <- median(metadata$popularity, na.rm = TRUE)
tidy_dataframe_with_id <- tidy_dataframe_with_id %>%
mutate(movie_name = sub("_parsed", "", movie_id, fixed = TRUE))
metadata <- metadata %>%
mutate(success = ifelse(popularity >= median_popularity, "Successful", "Unsuccessful"))
metadata <- metadata %>%
mutate(movie_name = sub("_parsed_parsed_parsed", "", file_name, fixed = TRUE))
merged_data <- tidy_dataframe_with_id %>%
inner_join(metadata, by = "movie_name")
profanity_movies <- merged_data %>%
filter(word %in% profanities)
profanity_scores <- profanity_movies %>%
group_by(movie_name) %>%
summarize(profanity_score = n(), .groups = 'drop')
profanity_scores <- profanity_scores %>%
inner_join(metadata %>% select(movie_name, release_date, success), by = "movie_name")
top_10_profanity <- profanity_scores %>%
arrange(desc(profanity_score)) %>%
slice_head(n = 10)
top_10_profanity$movie_name_modified <- gsub("-", " ", top_10_profanity$movie_name)
# Top 10 Movies with Most Profanity
ggplot(top_10_profanity, aes(x = reorder(movie_name_modified, profanity_score), y = profanity_score, fill = profanity_score)) +
geom_col() +
coord_flip() +
scale_fill_viridis_c(direction = -1) +
labs(x = "Profanity Score", y = "Movie Name", title = "Top 10 Movies by Profanity Score") +
theme_minimal() +
theme(legend.position = "none")
# Profanity Over Time
profanity_scores$year <- format(as.Date(profanity_scores$release_date), "%Y")
profanity_over_time <- profanity_scores %>%
group_by(year) %>%
summarize(total_profanity = sum(profanity_score), .groups = 'drop') %>%
arrange(year)
profanity_over_time$year_numeric <- as.numeric(profanity_over_time$year)
# Generate the plot with adjustments
ggplot(profanity_over_time, aes(x = year_numeric, y = total_profanity)) +
geom_line() +
geom_point(color = "royalblue") + # A shade of blue for the dots
scale_x_continuous(breaks = seq(min(profanity_over_time$year_numeric, na.rm = TRUE),
max(profanity_over_time$year_numeric, na.rm = TRUE), by = 5),
labels = function(x) as.character(x)) +
labs(x = "Year", y = "Total Profanity Score", title = "Profanity in Movies Over Time") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Examine the impact of script simplicity on its success by calculating a readability score (Flesch Reading Ease, Flesch-Kincaid, or other measures) for the scripts. Analyze and visualize the relationship between the readability of the scripts and their IMDb vote average, providing commentary on your findings.
library(quanteda)
## Package version: 3.3.1
## Unicode version: 14.0
## ICU version: 71.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
library(quanteda.textstats)
dialogues <- dialogues %>%
filter(!is.na(Content))
readability_scores <- dialogues %>%
group_by(movie_id) %>%
summarize(script = paste(Content, collapse = " ")) %>%
mutate(flesch_score = textstat_readability(script, measure = "Flesch")[, "Flesch"])
readability_scores <- readability_scores %>%
mutate(movie_name = sub("_parsed", "", movie_id, fixed = TRUE))
movie_readability <- readability_scores %>%
inner_join(metadata, by = "movie_name") %>%
mutate(genres = str_split(genres, ",\\s*")) %>%
unnest(genres)
movie_readability <- movie_readability %>%
group_by(movie_id) %>%
summarize(mean_flesch_score = mean(flesch_score, na.rm = TRUE),
mean_vote_average = mean(vote_average, na.rm = TRUE))
ggplot(movie_readability, aes(x = mean_flesch_score, y = mean_vote_average)) +
geom_point(shape = 19, size = 2, color = "skyblue3") +
geom_smooth(method = "lm", se = FALSE, color = "tan1", size = 1) +
labs(x = "Flesch Reading Ease Score", y = "IMDb Vote Average", title = "Script Readability vs. IMDb Vote Average") +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.title = element_text(size = 12),
panel.grid.major = element_line(color = "snow3"),
panel.grid.minor = element_blank(),
plot.background = element_rect()
)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
A higher Flesch Reading Ease score indicates a text that is easier to read. I found a positive correlation between the Flesch score and IMDb vote average, which suggests that more readable scripts tend to be associated with higher-rated movies.
Now, use the NRC Word-Emotion Association Lexicon in the
tidytext package to identify a larger set of emotions
(e.g. anger, anticipation, disgust, fear, joy, sadness, surprise,
trust). Visualize the relationship between the use of words from these
categories and the movie genre. What is your finding?
dialogues <- dialogues %>%
mutate(movie_name = sub("_parsed", "", movie_id, fixed = TRUE))
tidy_dialogues <- dialogues %>%
mutate(Content = str_replace_all(tolower(Content), "[^\\w\\s']", " "),
Content = str_squish(Content)) %>%
unnest_tokens(word, Content) %>%
anti_join(stop_words, by = "word")
combined_data <- tidy_dialogues %>%
inner_join(metadata, by = "movie_name") %>%
mutate(genres = str_split(genres, ",\\s*")) %>%
unnest(genres)
combined_data <- combined_data %>% filter(!is.na(genres))
nrc_lexicon <- get_sentiments("nrc")
tidy_emotions <- combined_data %>%
inner_join(nrc_lexicon, by = "word")
emotions_by_genre <- tidy_emotions %>%
count(genres, sentiment) %>%
group_by(genres) %>%
mutate(prop = n / sum(n)) %>%
ungroup()
ggplot(emotions_by_genre, aes(x = genres, y = prop, fill = sentiment)) +
geom_bar(stat = "identity") +
labs(title = "Proportion of Emotions by Genre",
x = "Genre", y = "Proportion",
fill = "Emotion") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_viridis_d(option = "plasma")
For this visualization, I used proportions instead of raw counts to facilitate a more direct comparison of how different genres tend to express different emotions. My analysis revealed that surprise and joy are among the least commonly expressed emotions across the various movie genres. While the proportions of emotions varied slightly across genres, I observed a general consistency in the distribution of emotions. However, there were slightly more positive feelings, such as joy and trust, observed in genres like romance, history, and TV movies, and slightly more negative feelings, such as disgust, observed in genres like horror.